This notebook presents the exploratory data analysis process of the Mental Health Dataset through visualization.
The notebook utilizes code from MH_EDA.py and GeoBound_ChoroplethMap.py. These files can be found in the same folder as the notebook.
import MH_EDA as mh
import GeoBound_ChoroplethMap as gb
from IPython.display import Image
First of all, we need to load the mental health dataset, clean it and remove less relevant features.
mh_file_path = 'MHDS/Original/500_Cities__City-level_Data__GIS_Friendly_Format___2017_release_20240514.csv'
raw_df= mh.mh_load_file(mh_file_path)
raw_df.head()
| StateAbbr | PlaceName | PlaceFIPS | Population2010 | ACCESS2_CrudePrev | ACCESS2_Crude95CI | ACCESS2_AdjPrev | ACCESS2_Adj95CI | ARTHRITIS_CrudePrev | ARTHRITIS_Crude95CI | ... | SLEEP_Adj95CI | STROKE_CrudePrev | STROKE_Crude95CI | STROKE_AdjPrev | STROKE_Adj95CI | TEETHLOST_CrudePrev | TEETHLOST_Crude95CI | TEETHLOST_AdjPrev | TEETHLOST_Adj95CI | Geolocation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AL | Birmingham | 107000 | 212237 | 19.6 | (19.2, 20.0) | 19.8 | (19.5, 20.2) | 30.9 | (30.8, 31.1) | ... | (46.6, 47.0) | 5.2 | ( 5.1, 5.3) | 5.2 | ( 5.1, 5.2) | 26.1 | (25.1, 27.2) | 25.9 | (25.0, 26.9) | (33.52756637730, -86.7988174678) |
| 1 | AL | Hoover | 135896 | 81619 | 9.7 | ( 9.3, 10.1) | 9.9 | ( 9.5, 10.4) | 25.3 | (25.0, 25.7) | ... | (34.2, 35.0) | 2.2 | ( 2.1, 2.3) | 2.2 | ( 2.1, 2.2) | 9.6 | ( 8.6, 10.8) | 9.5 | ( 8.5, 10.9) | (33.37676027290, -86.8051937568) |
| 2 | AL | Huntsville | 137000 | 180105 | 15.1 | (14.7, 15.4) | 15.1 | (14.8, 15.5) | 27.5 | (27.3, 27.7) | ... | (39.4, 40.0) | 3.4 | ( 3.3, 3.4) | 3.3 | ( 3.2, 3.3) | 14.9 | (14.1, 15.7) | 14.7 | (13.8, 15.5) | (34.69896926710, -86.6387042882) |
| 3 | AL | Mobile | 150000 | 195111 | 16.9 | (16.6, 17.2) | 17.2 | (16.9, 17.5) | 30.5 | (30.3, 30.6) | ... | (42.0, 42.4) | 4.4 | ( 4.3, 4.5) | 4.1 | ( 4.1, 4.2) | 24.3 | (23.4, 25.3) | 24.1 | (23.1, 25.0) | (30.67762486480, -88.1184482714) |
| 4 | AL | Montgomery | 151000 | 205764 | 17.4 | (17.0, 17.9) | 17.5 | (17.1, 17.9) | 29.8 | (29.7, 30.0) | ... | (41.0, 41.5) | 4.1 | ( 4.1, 4.2) | 4.2 | ( 4.1, 4.3) | 21.2 | (20.3, 22.2) | 21.2 | (20.1, 22.2) | (32.34726453330, -86.2677059552) |
5 rows × 117 columns
We are not interested in other chronic diseases, hence I will remove irrelevant chronic diseases and retain features related to mental health, along with other essential features.
mh_df = mh.mh_remove_chronics(raw_df)
mh_df.head()
| StateAbbr | PlaceName | PlaceFIPS | Population2010 | MHLTH_CrudePrev | MHLTH_Crude95CI | MHLTH_AdjPrev | MHLTH_Adj95CI | Geolocation | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AL | Birmingham | 107000 | 212237 | 15.6 | (15.4, 15.8) | 15.6 | (15.4, 15.8) | (33.52756637730, -86.7988174678) |
| 1 | AL | Hoover | 135896 | 81619 | 10.4 | (10.1, 10.7) | 10.4 | (10.1, 10.7) | (33.37676027290, -86.8051937568) |
| 2 | AL | Huntsville | 137000 | 180105 | 13.3 | (13.1, 13.6) | 13.4 | (13.2, 13.7) | (34.69896926710, -86.6387042882) |
| 3 | AL | Mobile | 150000 | 195111 | 14.9 | (14.7, 15.1) | 15.0 | (14.9, 15.2) | (30.67762486480, -88.1184482714) |
| 4 | AL | Montgomery | 151000 | 205764 | 14.9 | (14.7, 15.2) | 14.8 | (14.6, 15.1) | (32.34726453330, -86.2677059552) |
The dataset has been refined by excluding other chronic diseases, resulting in a dataframe focused on mental health. The remained features are explained in the table below:
| Features | Type | Meaning |
|---|---|---|
| StateAbbr | Plain Text | State abbreviation |
| PlaceName | Plain Text | City name |
| PlaceFIPS | Number | City FIPS Code |
| Population2010 | Number | 2010 Census population count |
| MHLTH_CrudePrev | Number | Crude prevalence of poor mental health for 14 days or more among adults aged 18 years and older, 2015. Crude prevalence represents the ratio of the total number of responses of 'not good' to the total number of valid responses (excluding those who refused to answer, provided no response, or indicated 'don’t know/not sure'). |
| MHLTH_Crude95CI | Plain Text | Estimated 95% confidence interval for crude prevalence |
| MHLTH_AdjPrev | Number | Age-adjusted prevalence, standardized by the direct method to the year 2000 standard U.S. population, distribution 9. [1] |
| MHLTH_Adj95CI | Plain Text | Estimated 95% Confidence interval for age-adjusted prevalence |
| Geolocation | Plain Text | Latitude, longitude of city centroid |
Further cleaning and manipulation will be necessary as some features are less useful or stored in an incorrect format:
Removing Features:
Transforming Format:
[1] The direct method, aligned with the year 2000 standard U.S. population distribution 9, is a statistical technique used to adjust for age differences by assigning different weights to various age groups. This method is a policy mandated by the Department of Health and Human Services (DHHS) across all its agencies, aiming to enhance the comparability of age-adjusted rates among data systems.(reference) Distribution 9 indicates that this age-adjusted prevalence uses the weighting factors provided by Distribution 9. For more information about the weight, check page 3.
mhdf = mh.mh_secondary_remove_and_transform(mh_df)
mhdf.head()
| StateAbbr | PlaceName | Population2010 | MHLTH_AdjPrev | MHLTH_Adj95CI | Geolocation | |
|---|---|---|---|---|---|---|
| 0 | AL | Birmingham | 212237 | 15.6 | (15.4, 15.8) | [33.5275663773, -86.7988174678] |
| 1 | AL | Hoover | 81619 | 10.4 | (10.1, 10.7) | [33.3767602729, -86.8051937568] |
| 2 | AL | Huntsville | 180105 | 13.4 | (13.2, 13.7) | [34.6989692671, -86.6387042882] |
| 3 | AL | Mobile | 195111 | 15.0 | (14.9, 15.2) | [30.6776248648, -88.1184482714] |
| 4 | AL | Montgomery | 205764 | 14.8 | (14.6, 15.1) | [32.3472645333, -86.2677059552] |
I chose a treemap to present the overall status of all 500 cities instead of a bar chart because it effectively utilizes size and color to clearly depict the mental health prevalence in each city. Imagine trying to read data from a bar chart with 500 bars!
Plotly is an interactive visualization tool that allows us to extract more information by hovering over or clicking on a box to obtain detailed information.
Unfortunately, the interactive function is not available in the GitHub environment. The following pictures are treated as static visualizations to provide an overview. To explore the interactive functions, download the visualizations using the following links and open them with any web browser:
fig_city = mh.mh_plotly_treemap(mhdf, city_level=True, title='Mental Health Prevalence by City')
# output png and html files
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.png')
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.html', tohtml = True)
# fig_city.show()
# disable following code when running in local environment:
Image(filename='MHDS/Visuals/fig_city.png')
According to the treemap above, the top 5 cities with most severe mental issues are:
Among these five cities, three are located in Massachusetts (MA). This observation raises the question of whether Massachusetts has the most severe mental health issues among all U.S. states.
To explore this, I will use a treemap to present state-level data. Treemaps are effective for displaying hierarchical data and illustrating the status of states along with the relationship between cities and their parent state. By clicking on a state box in the treemap, users can view the average prevalence of mental health issues. This interactive approach can help identify states facing severe mental health challenges.
fig_statecity = mh.mh_plotly_treemap(mhdf)
# output png and html files
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.png')
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.html', tohtml = True)
# fig_statecity.show()
# disable following code when running in local environment:
Image(filename='MHDS/Visuals/fig_statecity.png')
Upon examining the treemap, it is clear that Massachusetts (MA), with an average prevalence of 15.06, is the second most affected state by severe mental health issues. Ohio (OH) has the highest severity, with an average prevalence of 15.37.
Despite this, Massachusetts has a higher number of cities with significant mental health challenges. Three out of 13 cities have a prevalence over 17, whereas Ohio has only one such city. However, the inclusion of more cities in Massachusetts's sample, some with lower prevalences like Newton (9.2), reduces the state's average. This variation in city selection can introduce significant bias if we analyze at a geographic level larger than the city.
Considering these variations, it raises the question: could larger geographic regions, such as regions and divisions, influence mental health?
How would geographic locations (regions and divisions) affect mental health prevalence?
Inference: Environmental and socioeconomic statuses vary significantly among regions and divisions. Intuitively, locations with less green space and lower socioeconomic status might exhibit higher mental health prevalence. Thus, I infer that central areas of the US could have more severe mental health issues. However, the results might also be influenced by the data collection method, such as fewer data points from central areas.
Regardless, let's dive into the data. I will explore how mental health prevalence varies among regions and divisions using the choropleth map provided by the Folium package, which offers interactive functionalities, making the map more informative.
BBoth regional and divisional boundary data can be found and downloaded here.
# prepare the regional average df
# import us_region dictionary
us_region = gb.us_region()
# apply labels from dictionary to mhdf and output a grouped df by regions
mh_regions_avg = mh.mh_apply_boundary(mhdf, 'Regions', us_region)
mh_regions_avg
| Regions | Population2010 | MHLTH_AdjPrev | |
|---|---|---|---|
| 0 | Midwest | 185709.967742 | 12.143011 |
| 1 | Northeast | 301909.553571 | 14.125000 |
| 2 | South | 203285.878205 | 12.705128 |
| 3 | West | 190411.533333 | 11.875897 |
# output geojson file of regions boundary if not exsited
regions_path = 'MHDS/Original/cb_2018_us_region_500k/cb_2018_us_region_500k.shp'
regional_bound = gb.bound_load_file_output_geojson(regions_path, full_state = True, output = True, output_folder = 'MHDS/', output_filename ='region_gdf.geojson')
regional_bound.head()
Be aware of large dataset! File already exists.
| REGIONCE | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0200000US1 | 1 | Northeast | 68 | 419357835545 | 50259300137 | MULTIPOLYGON (((-68.27472 44.25867, -68.27144 ... |
| 1 | 2 | 0200000US2 | 2 | Midwest | 68 | 1943997274253 | 184273267512 | MULTIPOLYGON (((-82.73571 41.60336, -82.73392 ... |
| 2 | 4 | 0200000US4 | 4 | West | 68 | 4536201747682 | 316587292459 | MULTIPOLYGON (((179.48246 51.98283, 179.48656 ... |
| 3 | 3 | 0200000US3 | 3 | South | 68 | 2249871668369 | 134084610547 | MULTIPOLYGON (((-75.56555 39.51485, -75.56174 ... |
# create regional choropleth map
m_regions = gb.choropleth_map('MHDS/region_gdf.geojson', mh_regions_avg)
display(m_regions)
Although the map shows that the Northeast region appears to have more severe mental health issues (14.2% on average), the regional map seems less informative. Therefore, we will explore further based on the nine divisions of the U.S.
# same as above, create divisional average df and boundary file
us_division = gb.us_division()
mh_division_avg = mh.mh_apply_boundary(mhdf, 'Divisions', us_division)
display(mh_division_avg.head())
division_path = 'MHDS/Original/cb_2018_us_division_500k/cb_2018_us_division_500k.shp'
divisional_bound = gb.bound_load_file_output_geojson(division_path, full_state = True, output = True, output_folder = 'MHDS/', output_filename ='division_gdf.geojson')
| Divisions | Population2010 | MHLTH_AdjPrev | |
|---|---|---|---|
| 0 | East North Central | 198099.590164 | 12.727869 |
| 1 | East South Central | 246005.875000 | 14.425000 |
| 2 | Middle Atlantic | 510628.560000 | 14.356000 |
| 3 | Mountain | 202059.040000 | 11.460000 |
| 4 | New England | 118945.137931 | 13.941379 |
Be aware of large dataset! File already exists.
# create the divisional choropleth map
m = gb.choropleth_map('MHDS/division_gdf.geojson', mh_division_avg, geo_col=['Divisions','MHLTH_AdjPrev'])
display(m)
From the map above, we can see that three divisions appear to have more severe mental health issues than other divisions: East South Central (14.42%), Middle Atlantic (14.36%), and New England (13.94%).
In fact, the entire Eastern region seems to experience more severe mental health issues compared to the Central and Western regions. Given that the Eastern area has a distinct environment and socioeconomic status compared to the Central and Western regions, this distinction provides a valuable starting point to further explore how environmental and socioeconomic factors correlate with mental health prevalence.
Does the size of a population affect mental health (MH) prevalence?
Inference: The size of the population, often a reference to the size of a city, can influence mental health through two aspects:
The relationship is complex, so let’s explore it using the dataset mhdf.
First, we need to sort cities into different size groups based on their population. Following the OECD Classification, we can categorize cities into four groups under a new column CitySize:
We will then apply this classification to mhdf and create a new DataFrame, df_CitySize, that includes the number of cities and the average MH prevalence for each city-size group (using groupby on CitySize). We may also create a squared MH prevalence column (square_MHLTH_AdjPrev) for better visualization.
Finally, we will use Altair (a visualization package that allows for flexible customizations) to create a visualization of Population vs. MH Prev combining a bar chart (presenting the number of cities for each group) and a scatter plot (presenting the average MH prevalence) to analyze the influence of population size on MH prevalence.
city_size_dict = {
'Small Urban Areas': [50000,200000],
'Medium-Size Urban Areas': [200000,500000],
'Metropolitan Areas': [500000,1500000],
'Large Metropolitan Areas': [1500000, 100**100]
}
# call function to add CitySize col to mhdf and output a grouped df
df_CitySzie = mh.mh_apply_CitySize(mhdf, city_size_dict)
df_CitySzie
| CitySize | Population2010 | MHLTH_AdjPrev | square_MHLTH_AdjPrev | |
|---|---|---|---|---|
| 0 | Large Metropolitan Areas | 5 | 12.880000 | 165.894400 |
| 1 | Medium-Size Urban Areas | 73 | 12.549315 | 157.485309 |
| 2 | Metropolitan Areas | 29 | 12.458621 | 155.217229 |
| 3 | Small Urban Areas | 392 | 12.410204 | 154.013165 |
# call function to present the Population vs. MH Prev
mh.mh_pop_vs_mh(df_CitySzie)
Unfortunately, according to the chart above, the size of the population seems uncorrelated with mental health prevalence.
I think it would be better to delve deeper into accessibility to green spaces and socioeconomic status instead of focusing on population size.